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Abstract 


In many contexts, there is interest in selecting the most important variables from 
a very large collection, commonly referred to as support recovery or variable, fea¬ 
ture or subset selection. There is an enormous literature proposing a rich variety of 
algorithms. In scientific applications, it is of crucial importance to quantify uncer¬ 
tainty in variable selection, providing measures of statistical significance for each 
variable. The overwhelming majority of algorithms fail to produce such measures. 
This has led to a focus in the scientific literature on independent screening meth¬ 
ods, which examine each variable in isolation, obtaining p-values measuring the 
significance of marginal associations. Bayesian methods provide an alternative, 
with marginal inclusion probabilities used in place of p-values. Bayesian vari¬ 
able selection has advantages, but is impractical computationally beyond small 
problems. In this article, we show that approximate message passing (AMP) and 
Bayesian compressed regression (BCR) can be used to rapidly obtain accurate 
approximations to marginal inclusion probabilities in high-dimensional variable 
selection. Theoretical support is provided, simulation studies are conducted to as¬ 
sess performance, and the method is applied to a study relating brain networks to 
creative reasoning. 


1 Introduction 

In many contexts, there is interest in selecting important variables from a very large collection. Think 
for instance of gene expression data or neuroimaging data, where the number of potential features 
(predictors) exceeds the sample size. To deal with this problem, many variable selection methods 
have been developed that scale well to large numbers of variables, with Lasso/Li-penalization pro¬ 
viding one example. A major downside of many of these fast methods is that they do not provide 
a well-defined measure of statistical significance. Being able to evaluate statistical evidence about 
whether a variable should be included or not is however crucial in making informed variable se¬ 
lection decisions. In fact, in many biomedical applications, the main emphasis is on identifying 
which variables should be included, while reporting the level of evidence in the data that these are 
important variables; prediction is not directly of interest. 

Modeling the data in a Bayesian fashion provides a natural framework to evaluate statistical evidence 
via the posterior. Even though many Bayesian variable selection methods exist m, they typically 
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rely on Monte Carlo sampling for inference El 13111, which does not scale well with the number 
of candidate predictors. This led us to develop a general approximation framework for marginal 
posteriors in Bayesian linear regression, which can handle large numbers of candidate predictors 
by treating all but one of their coefficients as nuisance parameters and integrating them out (ap¬ 
proximately) from the likelihood Q. As an example, we will focus on the spike-and-slab prior for 
variable selection. Our method provides an estimate of the posterior probability of inclusion for 
each potential predictor. 

Our contributions can be summarized as follows: 

• This paper presents a novel framework for marginal posterior approximation which is scal¬ 
able and can handle highly non-Gaussian prior and posterior distributions. 

• It is shown how two state-of-the-art methods — BCR and AMP — can be used within 
the marginal approximation framework by providing an approximation of the posterior 
predictive distribution of rotated data. 

• The framework is applied to the problem of Bayesian variable section with a spike-and-slab 
prior on both simulated data and a real study relating brain networks to creative reasoning. 


1.1 Bayesian Linear Regression 


Consider the standard linear regression model, 

y = XP + e, e ~A/'(0, cr^/„), 


( 1 ) 


where X is a fixed n x p matrix of features, /? is a p x 1 vector of unknown coefficients, y is an 
n X 1 response vector, e is an n x 1 residual vector and cr^ is the error variance. Assuming that j3 is 
drawn according to a prior distribution tt, the posterior distribution obeys 

p(/3|t/) cxexp(-2^||y-X/3f)7r(/3). (2) 

In practice, the joint posterior is difficult to visualize and interpret, and one routinely bases infer¬ 
ences on summaries of marginal posterior distributions for univariate functionals of the parameters. 
The marginal posterior distribution of coefficient jSj is obtained by marginalizing out the other 
coefficients P{-j), and is given by 


p{l3j\y) 



(3) 


The posterior predictive distribution is the distribution of the response j/new to a new vector a;new 
conditional on having observed {y,X), and is given by 

piynew\y) OC / exp (- 2 ^(t/new “ P{l3\y)dp. (4) 

JRP 

As expressed above, both the posterior marginal distribution and the posterior predictive distribu¬ 
tion require computing a high-dimensional integral over the posterior distribution. These integrals 
are challenging to compute in general, a massive literature has focused on scalable approximation 
methods. 


1.2 Bayesian Variable Selection Using a Spike-and-Slab Prior 

The problem of variable selection is to identify which entries of the coefficient vector are nonzero. 
A standard approach in Bayesian variable selection is to assign a spike-and-slab prior on the coeffi¬ 
cients having the form 

/3j ~ (1 - A)5o + AA/'(O,'0), j = l,...,p, (5) 

where (5o is a point mass at zero, ip) ^ Gaussian distribution with mean zero and variance ip, 
and A G (0,1) is the prior inclusion probability. 

Let 7 be a binary p x 1 vector where jj is an indicator of the event {f3j ^ 0}. The posterior marginal 
inclusion probability of the jth coefficient is given by 

.—ft 

(1 - X)p{y\ij = 0) -f Ap(t/| 7 j = 1) 
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<^>^ = X^X^^ + a^I, ( 7 ) 


where p{y\jj) can be expressed explicitly as: 

p{yhj)^ AA(2/|0,<i>y)A'=(^')(l-Ari-'=(^'), 

with fc( 7 ') = 'y'j'- Note that the marginal inclusion probabilities are available in a simple 

analytic form involving evaluation of normal densities and simple weights. However, the complexity 
of the summation grows exponentially in p and thus direct computation is infeasible for large p. 

MCMC and related sampling algorithms have been employed for posterior inference Eimi. Such 
algorithms attempt to sample efficiently from the massive dimensional space of 2^ possible models; 
for p even moderately large (in the 100s to 1000s) the space is so huge that there is no hope of 
visiting more than a vanishingly small proportion of the models. For example, « 1.27 x 10^°. 
This leads to high Monte Carlo error in estimating posterior model probabilities, with almost all 
models estimated to have zero probability (as they are never visited). The topology of the model 
space makes efficient computation even more challenging, with local regions containing “good” 
models often separated by large regions containing relatively poor models. Sampling algorithms 
have trouble efficiently moving between these isolated regions. This has motivated a rich literature 
on better samplers 0121 . However, one can argue that sampling is intrinsically intractable (e.g., 
even the best samplers can find a much better model after 10 million iterations), motivating our fast 
approximation approach. 

2 Posterior Marginal Approximation with IID Priors 

This section describes our framework for approximation of the posterior marginal distributions for 
an arbitrary iid prior on the coefficient vector /3. 

1. The first step is to apply a rotation to the observed data that decouples the dependence between an 
unknown coefficient of interest and the other unknown coefficients, which are viewed as nuisance 
parameters. This leads to a representation of the marginal posterior in terms of the posterior 
predictive distribution of a modified linear regression problem. 

2. The second step is to replace the posterior predictive distribution obtained in the first step with a 
(non-standard) Gaussian approximation. Interestingly, this approximation can be highly accurate 
even if the prior and posterior are highly non-Gaussian. Recent techniques in the literature are 
used to compute the mean and variance. 

2.1 Connection Between Posterior Marginal and Posterior Prediction 

This section describes how the posterior marginal distribution of the jth unknown coefficient can be 
expressed in terms of the posterior predictive distribution of a rotated regression problem. 

For a fixed index j, consider the rotated data (z, y) defined by 

z = q'[y, y = Q^y, ( 8 ) 

where qi = Xj7||a;j|| is the unit vector in the direction of the jth column of X and Q 2 is an nx (n— 1) 
matrix chosen arbitrarily subject to the constraint that Q 2 Q 2 = In — ■ Since the n x n matrix 

Q = [ 91 IQ 2 ] is full rank, the mapping between y and {z,y) is one-to-one. To characterize these 
terms, we introduce the notation 

^ II; ^new — Ql ^ — ^2 ^new — ^ dj^j. 

The first three terms are functions of X and the last term is an auxiliary variable that cannot be 
observed directly, since j3j is unknown. Following from the rotational invariance of the Gaussian 
distribution, the distribution of the rotated data can now be expressed as 

, (9) 

(^new/^C—i) f ) 5 (10) 

Z = (X^j + ^new ■ (11) 
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The important property of this decomposition is that y does not depend on fij. Thus, conditioned on 
z, the posterior predictive distribution p(ynew|2/) is a sufficient statistic for inference about f3j. This 
means that it is now sufficient to consider the scalar model 

Z — T ynew; ^ ^7 Vnew ^ f i (1^) 

where / denotes the posterior predictive distribution p{ynew\y)- The posterior marginal distribution 
of Pj obeys 

P{Pj\y) oc fiz - aPj)Tr{Pj). (13) 


2.2 Approximation of the Posterior Predictive Distribution 

The main challenge in using the formulation of the previous section to efficiently approximate 
the marginal posterior of Pj is that computation of the exact posterior predictive distribution is 
intractable. The key insight underlying our approach is that, in many cases of interest, the poste¬ 
rior predictive distribution can be well-approximated by a Gaussian density, even if the prior and 
posterior distributions of the unknown coefficients are highly non-Gaussian. 

To obtain an approximation of p{Pj\y), we propose to first obtain a Gaussian approximation / for 
the posterior predictive distribution /, and then plug this approximation into ( [T3] l to compute the 
approximation of the posterior on Pj , 

P{l^j\y) oc f{z - aPj)TT{Pj). (14) 

The approximation / has the form Af{p, r^), where the mean p and variance are functions of the 
data set [y, X, inew)- The problem of computing (^, t^) can be attacked adapting a variety of recent 
techniques in the literature. Two of these are described in Sec.[^ 


2.3 Approximation for Variable Selection 


We now show how our mar ginal approximation framework can be applied to the problem of variable 
selection described in Sec. 


1.2 


Let TT be the spike-and-slab prior in Q and let N{p,T^) be the 
approximation of /. The approximation for the posterior marginal distribution of Pj is then a spike- 
and-slab distribution of the form 


(1 - \j)5tj + XjM{mj,ipj), 


(15) 


where TOj = and Aj is the approximation of the posterior marginal inclusion 

probability; 


\N{z\p, aPij} + 

(1 — X)Pf{z\p, T^) + XAf{z\p, aPij) -I- T^) 


(16) 


2.4 Analysis of Framework 

A particularly useful property of our approximation framework is that the discrepancy between 
the posterior predictive distribution / and its approximation / does not depend on the unknown 
coefficient Pj whose posterior marginal distribution we are trying to compute. As a consequence, if 
/ converges to / under a suitable metric, then it follows under very weak assumptions on the prior 
TT that the posterior marginal approximation also converges to the true posterior marginal. 

As a heuristic justification for the Gaussian approximation of ynewly, consider a setting in which 
the entries of inew are of the same order. Then, the a priori distribution of jinew is approximately 
Gaussian by the central limit theorem for sums of independent variables. Provided that the entries 
of P(-j) given y are weakly correlated, it can then be argued that the posterior distribution of ynew is 
also approximately Gaussian. Using ideas from iliii, this line of reasoning can be made rigorous 
for certain classes of large random matrices X. It is important to note that approximate Gaussianity 
of the predictive distribution does not hold in the setting where a small number of other predictors 
are highly collinear with Xj. 
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Algorithm 1 Bayesian Compressed Regression (BCR) for posterior predictive distribution 
Input: data (y, X), new vector Xnew, BCR parameters {k, m, K). 

1. Run the BCR algorithm from lfT4ll for K random projections: 

1: for fc = 1,..., iC do 
2: Draw 0 ~ (^(0.1, 0.9) 

3: Sample each element in the p x m matrix 0 from (— y/ljO, 0) with probabilities 

(02, (1 - 0)2,2(1 - 0)0) but such that 0 is full rank. 

4: Orthonormalize 0 using a Gram-Schmidt process. 

5: Compute the predictive mean and variance of new response j/new based on BCR with 

projection matrix 0 according to 

fife = X'^ XQ + (cr2/K)/„)"^0'^A^y, 

tI = Xnew0(e'^x^xe + ((7^/K)Im)~^0'^X^^„ + (T^. 

6: Compute the model weight Wk which equals the marginal likelihood of the linear regres¬ 

sion model y ^ N(XQa, a'^In) with prior a ^ A/’(0, nlm) 

2. Model average the means and variances of new response j/new according to 

E K 2 2 

k=i wkf^k, = L/c=i wkn- 

Return: approximate posterior predictive distribution JV{ynew\l^, x^). 


In contrast to many of the existing approximation methods, our framework can handle posteriors 
which are multimodal and posteriors which are discrete-continuous mixtures. This is not possible 
using methods based on direct normal-type approximations or Laplace’s method ifTOlfTTlfT^ . Also, 
we note that previous work has shown how confidence intervals can be obtained for various M- 
estimators E). Our work differs in that we can handle an arbitrary prior distribution, our two- 
stage procedure decouples the interaction between the coefficient of interest and the approximation, 
and our framework permits the use of a variety of methods to compute the posterior predictive 
distribution. 


3 Methods for Posterior Predictive Approximation 


The framework described in Sec. [^requires the approximation of the posterior predictive distribution 
of a rotated linear regression problem. This section shows how two recent methods — BCR and 
AMP — can be used to obtain this approximation. Both of these methods are scalable and have 
theoretical performance guarantees in the high-dimensional setting. 

Throughout this section, the methods are described in the context of the usual posterior predictive 
distribution problem given in Sec. o For the purposes of our approximation framework, however, 
it is important to remember that these methods are not applied to the original data {y, A), but rather 
to the rotated data (y, X, Xnew) defined in Sec. 


3.1 Bayesian Compressed Regression 

BCR is inspired by data squashing and compressive sensing but is fundamentally different as it 
reduces only the number of predictors but not the sample size m. It computes the exact posterior 
after randomly projecting the nxp matrix X of features, using a random p x m (m < p) compression 
matrix 0, to the compressed n x m matrix with compressed features A0. The coefficients /3 are 
then replaced with an m-dimensional vector a, which is assigned a normal prior. We now have the 
likelihood y ^ M{XQa, a'^In)- Conditional on the projection, this then readily yields as posterior 
predictive for t/newly ^ normal distribution / = r^). See Algorithm for the details, using the 

prior a ^ Af(0, Kim) and the random projection from lfT4l . As in mi, we model average over K 
different random projections. 


2.1 
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Algorithm 2 Approximate Message Passing (AMP) for posterior predictive distribution 
Input: data (y, X), new vector Xnew, prior distribution tt, error variance cr^. 

1. Run sum-product approximate message passing with optimal nonlinearity defined by the prior 
TT and obtain point estimates of the posterior mean m and posterior marginal variance v of the 
regression coefficients. For full details, see e.g. El Algorithm 1] or im Sec. 2.5 ]. 

2. Compute the mean and variance of new response t/new according to 

H diag(u)a:new + cr^- 

Return: approximate posterior predictive distribution A^{yRew\^J■, x"^)- 


For the spike-and-slab prior in ([^, we will set k = tjj while m can be chosen based on A. To see the 
latter, the prior on a induces a singular prior on /3 that lives in an m-dimensional hyperplane in 
so a lower m implies a more restricted /3 which is a bit similar to using a smaller A in (|^. Secs.g 
and ^use K = 10. 01 compressed all features to obtain a computationally efficient approach to 
prediction without the possibility of variable selection; here, we use BCR in a fundamentally new 
and different manner by leveraging the posterior approximation framework from Sec.|^ 

The theoretical support for BCR applies to the high-dimensional setting where both n and p tend 
to infinity with p < exp{6n‘’) for 5 > 0 and ( G (0,1). In this setting, concentration of the pos¬ 
terior approximation / is shown in ifTl under some regularity conditions on X and (3. Combining 
this result with Theorem 1 in ITSl shows that, under the spike-and-slab prior, the approximation / 
converges in probability to the true predictive distribution /. 

3.2 Approximate Message Passing 

The second method we consider is approximate message passing (AMP) EllEl. This algorithm 
is based on Gaussian and quadratic approximations of loopy belief propagation in graphical models. 
It can also be viewed as a forward-backward primal-dual method for minimizing an approximation 
to the large-system-limit Bethe free energy ifTSl . The algorithm is defined by scalar denoising func¬ 
tions, and optimal performance is obtained when this function is matched to the prior distribution of 
the coefficients ifT^ . 

For certain classes of large random matrices X the behavior of AMP can be characterized rigorously 
as a function of the prior distribution on the coefficients Ha. Furthermore, in these settings, 
simulations show that the algorithm converges rapidly and is often much faster than other state-of- 
the-art optimization methods. For general matrices, however, analysis of AMP is more challenging. 
In practice, convergence of the algorithm may require dampening ifTSll or serial updates ll20ll . 

The output of AMP can be viewed as an approximation to the posterior marginal distribution of the 
coefficients, and is thus directly related the framework described in this paper. A key difference, 
however, is that we do not run AMP on the entire data. Instead, we first use AMP to obtain an 
estimate of the posterior predictive distribution of ynew and then combine this with the scalar mea¬ 
surement z. This two-stage procedure has the advantage that the first stage is independent of the 
coefficient of interest. A high-level overview of our implementation of AMP for approximation of 
the posterior predictive distribution is given in Algorithm 2. 

The theoretical support for AMP applies to the high-dimensional setting where both n and p tend to 
infinity with n/p -G 6 for a fixed ratio 5 G (0, oo). Then, under the assumption that the entries of 
X are iid zero-mean Gaussian variables, it follows from results in 13 [l9l that the approximation / 
converges in probability to the true predictive distribution /, provided that S > So where Jq G (0,1) 
depends on the prior tt and the error level cr^. 


4 Simulation Studies 

To test our methods to approximate posterior inclusion probabilities, we apply them to an ex¬ 
ample where we can compute the exact inclusion probabilities. Consider the linear model y ^ 
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Figure 1; Average (dot) and 20th to 80th percentile (line) of the MSE of the posterior inclusion 
probabilities estimates by the AMP and BCR algorithm in red on the left and blue on the right, 
respectively. 


Af{Xj3, with n = 100, the p = 12 parameters /3 = (3,1.5, 2, 0, 0,0,0, 0, 0, 0,0,0)^ and the 
matrix with features X such that its columns are identically normally distributed with correlation 
between column i and j and the elements within a column are iid, similar to Example 1 from 
Ea. According to this, we generate 100 data sets for each p = 0, 0.1,..., 0.9 with such that the 
signal-to-noise ratio equals 2 and we set ijj = lOa^. Then we compute the exact posterior inclusion 
probabilities and run our algorithm with AMP and with BCR (m = 5) where A = 3/12. Summaries 
of the MSE of the inclusion probabilities provided by our algorithms compared to the true posterior 
inclusion probabilities are shown in Eig. [T] We see that both approximation methods do a good job 
for p = 0. As the collinearity in X increases, the MSE increases as well but is still sufficiently small 
for the inclusion probability estimates to be meaningful. 

In practice, and A are parameters that need to be tuned. Eor the application in the next section, we 
build this tuning into our algorithms. Eor AMP, we use iterative updating of very similar to what is 
proposed in 1^ and also update A analogously based on the current inclusion probability estimates 
at each iteration. BCR allows for marginalizing out cr^ M- Eor this, we use a cr^ ^ 2it/(3,1) 
prior after standardization of both X and y. We then iterate the BCR algorithm where each time 
we update A in the same manner as for the AMP algorithm. Convergence of A required only a few 
iterations. 
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(b) p = 0.2, 
SNR= 1 


(c) p = 0.5, 
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(e) p = 0.8, 
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Eigure 2: Results for BCR with cr^ and A unknown and with j the index of the parameter. 


To check whether these additions to our methods are valid, we created the box plots in Eigs. 
and each based on 200 simulated data sets. The design matrix is generated as earlier in this 
section but now p = 7 with /? = (3,1.5, 2, 0, 0,0, 0)^. Note that AMP is more aggressive than 
BCR in setting inclusion probabilities to the extreme values 0 and 1. Eurthermore, AMP struggles 
with p = 0.7,0.8. The results show that our methods can provide meaningful posterior inclusion 
probabilities even when A and cr^ are unknown. 
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Figure 3; Results for AMP with cr^ and A unknown and with j the index of the parameter. 




(b) AMP 


Figure 4: Horizontal section of a weighted brain network visualization where the weights are given 
by the posterior inclusion probability of each edge with red to white indicating low to high. For each 
algorithm, only the 60 edges with the highest inclusion probabilities are plotted. 


5 Applications 


This section applies our approximation framework to neuroscience data. We use the brain network 
data from ll^ which is available as the MRN-111 data set from http : //openconnecto .me/ 
data/public/MR/MIGRA.INE_vl_0/ These are connectomes with counts of the number of 
connections between the 70 different brain regions from the Desikan atlas ll25l . The dependent 
variable is the composite creativity index (CCI) for n = 113 persons from ll24l . Out of all pairs 
of brain regions, 1802 have a connection for at least on person. The connection counts for these 
p = 1802 pairs form the linear predictors in our model. These counts are zero for half of these pairs 
with a mean of 1477 and a maximum of 58090. 

See Fig. 1^ for a summary of the inclusion probability estimates provided by AMP and BCR (m = 
20). It is noteworthy that Fig.|^has similarities with Fig. 10 from ESI which considers the same 
data set but using a Bayesian nonparametric model that is substantially more complex. For instance, 
both contain edges from node 18L to 3R, 18R and 20R. We also note that this is a highly challenging 
example due to the ill-conditioned, discrete and heavy-tailed nature of the feature matrix. This likely 
leads to some of the differences between the BCR and AMP-based approaches, though both tend to 
include more cross-hemisphere connections, which is consistent with previous evidence that more 
creative individuals have more connections between the right and left hemispheres. 
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6 Discussion 


Sec. [^presented a novel general framework for marginal posterior approximation. Via a rotation, 
the parameter of interest and the other ‘nuisance’ parameters are separated in the likelihood. This 
reduces the p-dimensional problem to a scalar one dependent on the influence of the nuisance pa¬ 
rameters. This influence, summarized in a posterior predictive, appears to be well approximated by 
a Gaussian for large p even when the full posterior is far from Gaussian. Sec. [^provided BCR and 
AMP as state-of-the-art methods for approximating this posterior predictive. We then focused on 
the spike-and-slab prior but note that the framework readily applies to many iid priors on (3. The 
first simulation in Sec.|4]showed that the framework both with BCR and AMP is able to estimate the 
posterior inclusion probabilities accurately. 

The proposed approach represents a substantial paradigm-shift in methods for estimating marginal 
posterior distributions in variable selection. There is an enormous literature proposing a wide va¬ 
riety of carefully designed sampling algorithms; the proposed novel framework leads to order of 
magnitude speed ups and improvements in stability. This same approach should be applicable much 
more widely than the Gaussian linear regression setting considered here. 
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